Variable length columns for CTable#624
Open
FrancescAlted wants to merge 24 commits intomainfrom
Open
Conversation
There was a problem hiding this comment.
Pull request overview
Adds variable-length list-valued columns to CTable via a new ListSpec schema type and a new ListArray container that abstracts variable-length storage over BatchArray/VLArray, including persistence and Arrow round-tripping.
Changes:
- Introduces
ListSpec+blosc2.list(...)schema API and schema (de)serialization support for list columns. - Adds
blosc2.ListArrayand integrates it intoCTablestorage, mutation, selection, persistence, and Arrow import/export. - Adds
SChunk.reorder_offsets()plus tests; updates docs/examples/benchmarks to reflect new capabilities and performance tooling.
Reviewed changes
Copilot reviewed 44 out of 44 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_schunk_reorder_offsets.py | New tests for SChunk.reorder_offsets() correctness and error handling. |
| tests/test_list_array.py | New tests for ListArray append/extend/indexing/persistence/Arrow roundtrip. |
| tests/ctable/test_varlen_schema_compiler.py | Tests for list spec building, schema compilation, and schema dict roundtrip. |
| tests/ctable/test_varlen_columns.py | New integration tests for list columns in CTable (append/extend/where/select/compact/persistence/Arrow). |
| tests/ctable/test_sort_by.py | Updates view sorting behavior tests (inplace vs copy). |
| src/blosc2/schunk.py | Adds Python-level SChunk.reorder_offsets() and open-dispatch for ListArray. |
| src/blosc2/schema_vectorized.py | Extends batch validation to validate list cells via coerce_list_cell. |
| src/blosc2/schema_validation.py | Extends row/rows validation to normalize and type list columns for Pydantic validation. |
| src/blosc2/schema_compiler.py | Adds ListSpec awareness (dtype optional), list annotation validation, and schema (de)serialization for list specs. |
| src/blosc2/schema.py | Introduces ListSpec and public blosc2.list(...) builder. |
| src/blosc2/list_array.py | New ListArray implementation over BatchArray/VLArray, including Arrow support and metadata tagging. |
| src/blosc2/dict_store.py | Adds support for persisting/discovering external ListArray leaves as .b2b. |
| src/blosc2/ctable_storage.py | Extends table storage to create/open list columns and improves index sidecar path handling for .b2z. |
| src/blosc2/ctable.py | Integrates list columns into CTable core operations (append/extend/select/compact/save/load/to_arrow/sort/copy/info). |
| src/blosc2/core.py | Adds from_cframe() dispatch for ListArray. |
| src/blosc2/blosc2_ext.pyx | Adds Cython binding for SChunk.reorder_offsets(). |
| src/blosc2/init.py | Exposes ListArray and list builder in the public API. |
| plans/ctable-varlen-cols.md | Detailed design/implementation plan for variable-length columns. |
| examples/ctable/varlen_columns.py | Example demonstrating list columns and ListArray usage. |
| examples/ctable/index_on_b2z.py | Example demonstrating index persistence across .b2z roundtrip. |
| doc/reference/list_array.rst | New reference docs for ListArray. |
| doc/reference/ctable.rst | Updates CTable docs to mention list columns and blosc2.list. |
| doc/reference/classes.rst | Adds ListArray to the documented class/module lists. |
| bench/ctable/where_selective.py | Uses perf_counter for timing. |
| bench/ctable/where_chain.py | Uses perf_counter; replaces unsupported boolean expression usage in DSL. |
| bench/ctable/varlen.py | New benchmark for varlen list columns across backends and access patterns. |
| bench/ctable/speed_iter.py | Reworks row-iteration benchmark with sampling and perf_counter. |
| bench/ctable/sort_by.py | New benchmark for sort_by() performance across scenarios. |
| bench/ctable/slice_to_array.py | Updates benchmark to use slicing directly (no .to_numpy()). |
| bench/ctable/slice_steps.py | Updates benchmark to use slicing directly (no .to_numpy()). |
| bench/ctable/slice.py | Uses perf_counter for timing. |
| bench/ctable/row_access.py | Uses perf_counter for timing. |
| bench/ctable/print.py | Reworks benchmark to compare ingestion + rendering cost with pandas using perf_counter. |
| bench/ctable/iteration_column.py | Updates benchmark to use slicing directly (no .to_numpy()). |
| bench/ctable/iter_rows.py | Reworks iteration benchmark cases and uses perf_counter. |
| bench/ctable/indexin.py | New benchmark comparing index kinds vs scan across selectivities and data layouts. |
| bench/ctable/indexin.md | Captured benchmark output for index kinds comparison. |
| bench/ctable/extend_vs_append.py | Reworks benchmark comparing append vs extend strategies with perf_counter. |
| bench/ctable/extend.py | Uses perf_counter for timing. |
| bench/ctable/expected_size.py | Uses perf_counter for timing. |
| bench/ctable/delete.py | Uses perf_counter for timing. |
| bench/ctable/ctable_v_pandas.py | Updates benchmark to use slicing directly (no .to_numpy()). |
| bench/ctable/compact.py | Uses perf_counter for timing. |
| bench/ctable/bench_persistency.py | Updates benchmark to use slicing directly (no .to_numpy()). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+289
to
+294
| def extend(self, values: Iterable[Any], *, validate: bool = True) -> None: | ||
| if validate: | ||
| cells = [coerce_list_cell(self.spec, v) for v in values] | ||
| else: | ||
| cells = [v if v is not None else [] for v in values] | ||
| if self.spec.storage == "vl": |
Comment on lines
1408
to
+1411
| def close(self) -> None: | ||
| """Close any persistent backing store held by this table.""" | ||
| with contextlib.suppress(Exception): | ||
| self._flush_varlen_columns() |
Comment on lines
+4806
to
+4810
| if isinstance(data, dict): | ||
| provided_names = set(data) & set(current_col_names) | ||
| new_nrows = len(next(iter(data.values()))) | ||
| raw_columns = {name: data[name] for name in provided_names} | ||
| elif isinstance(data, np.ndarray) and data.dtype.names is not None: |
Comment on lines
+224
to
+236
| def test_sort_view_inplace_raises(): | ||
| t = CTable(Row, new_data=DATA) | ||
| view = t.where(t["id"] > 2) | ||
| with pytest.raises(ValueError, match="view"): | ||
| view.sort_by("id") | ||
| with pytest.raises(ValueError, match="inplace"): | ||
| view.sort_by("id", inplace=True) | ||
|
|
||
|
|
||
| def test_sort_view_copy_works(): | ||
| t = CTable(Row, new_data=DATA) | ||
| view = t.where(t["id"] > 2) | ||
| sorted_view = view.sort_by("id", ascending=False) | ||
| ids = [sorted_view["id"][i] for i in range(len(sorted_view))] | ||
| assert ids == sorted(ids, reverse=True) |
- Allow CTable.where() to accept string expressions - Support column arithmetic expressions like t.where((t.x * t.y) > 0) - Unwrap Column operands during lazy expression fusion - Add tests for string, arithmetic, computed-column, and transcendental predicates - Document CTable.where() with signature and examples - Remove leftover line_profiler hook from ctable.py
- Add NDArray-like Column metadata and boolean operators - Document CTable.where boolean-expression semantics - Clarify use of &, |, ~ versus Python and/or/not - Modernize CTable examples and benchmarks to use column expressions - Add tests for Column shape/size/ndim, boolean ops, and sum dtype
- Add StructSpec and blosc2.struct. - Preserve optional schema metadata. - Support Arrow schema metadata in CTable.from_arrow_batches. - Support list<struct> import/export through ListArray. - Add Parquet list<struct> round-trip test.
Also, nullable columns are available generally, although automatic null sentinel selection is primarily for Arrow/Parquet import paths.
- Add context-scoped NullPolicy for inferred null sentinels - Support nullable=True in scalar CTable schema specs - Add per-column column_null_values overrides - Validate policy-derived null sentinels against column specs - Simplify Arrow import API to CTable.from_arrow(schema, batches) - Flush imported list columns batch-wise by default - Update docs, examples, and tests for nullable CTable columns
- Add src/blosc2/objectarray.py; remove src/blosc2/vlarray.py
- Rename class VLArray → ObjectArray, function vlarray_from_cframe →
objectarray_from_cframe throughout the codebase
- On-disk metadata tag kept as "vlarray" for backward compatibility
with existing stored files
- Update all imports, type hints, docstrings, error messages, and
comments in core.py, schunk.py, list_array.py, dict_store.py,
embed_store.py, tree_store.py, indexing.py, msgpack_utils.py,
batch_array.py, and __init__.py
- Rename tests/test_vlarray.py → tests/test_objectarray.py
- Rename examples/vlarray.py → examples/objectarray.py; update
vlstore-lazyudf.py and ref-object.py
- Update bench/indexing/query_cache_store_bench.py
- Rename doc/reference/vlarray.rst → objectarray.rst; update
classes.rst, misc.rst, msgpack_serialization.rst, schunk.rst,
list_array.rst, tutorials.rst, and both tutorial notebooks
- Add VLStringSpec, VLBytesSpec to schema.py with vlstring()/vlbytes()
factory functions; dtype=None, nullable via native None (no sentinel)
- Add vlstring/vlbytes to schema_compiler: _KIND_TO_SPEC, display width,
spec_from_metadata_dict, validate_annotation_matches_spec
- Add src/blosc2/scalar_array.py: internal _ScalarVarLenArray adapter over
BatchArray with pending-buffer flush strategy, prefix-sum batch lookup,
and type-checked coercion for str/bytes/None
- Add create/open_varlen_scalar_column to TableStorage backends
- Integrate vlstring/vlbytes throughout ctable.py: _is_varlen_scalar_column
helper, Column.is_varlen_scalar, append/extend/flush/grow/compact/open/
save/load/add_column, Arrow export (vlstring→pa.string(),
vlbytes→pa.large_binary()), sort and index guards with clear errors
- Export vlstring/vlbytes in __init__.py
- Add tests/ctable/test_vlstring_vlbytes.py: 44 tests covering schema
round-trip, adapter internals, CTable CRUD, persistence (b2d/b2z),
nullable nulls, Arrow export, and sort/index guards
- ctable: _arrow_type_to_spec now maps scalar string/binary to vlstring/vlbytes
when string_max_length=None (the default); fixed-width path requires explicit
string_max_length; remove dead second `if string_max_length is None:` blocks
- ctable: from_arrow/from_parquet: varlen scalar columns bypass sentinel logic;
column_null_values raises TypeError for vlstring/vlbytes columns; auto_null_sentinels
skips varlen scalar columns (nulls represented as native None)
- off/parquet-to-blosc2.py: strip singleton-list machinery for scalar strings;
classify_columns now routes string→vlstring and binary→vlbytes; remove string
length scan, sampling, slack, restart, and --force-list-string options; update
print_import_plan and export path accordingly
- tests: update test_from_arrow_string_max_length → two tests covering vlstring
default and explicit fixed-width; fix list/.tolist() assertions in roundtrip
tests; rework null policy tests to reflect vlstring not using sentinels; add
three new round-trip tests covering long strings, binary, and Parquet import/
export without singleton-list wrapping
- Add final validation for vlstring/vlbytes schema options - Tag BatchArray backends with ctable_varlen_scalar metadata - Validate varlen scalar backend metadata on reopen - Fix CTable constructor reopen path for vlstring/vlbytes columns - Add explicit lazy/query guard errors for varlen scalar columns - Extend vlstring/vlbytes persistence and guard tests - Confirm full pytest suite passes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This allows for columns in CTable that can host variable length entities (typically large objects or lists of smaller objects).
This introduces a new ListArray object that centralizes variable length handling.